Skip to content

⚡ Bolt: Combine keyword delimiters into single regex pattern#283

Closed
bashandbone wants to merge 3 commits intomainfrom
bolt/combine-regex-patterns-15926007155550964703
Closed

⚡ Bolt: Combine keyword delimiters into single regex pattern#283
bashandbone wants to merge 3 commits intomainfrom
bolt/combine-regex-patterns-15926007155550964703

Conversation

@bashandbone
Copy link
Copy Markdown
Contributor

@bashandbone bashandbone commented Mar 28, 2026

💡 What: Optimize the _match_keyword_delimiters method by combining multiple keyword delimiters into a single compiled regex pattern.
🎯 Why: Previously, the engine iterated over every registered keyword delimiter and scanned the entire document string with a separate regex re.finditer for each one, resulting in O(K * N) complexity. By combining them using alternation (?:...) and looking up the matching delimiter from a dictionary, the regex engine only needs to scan the document exactly once, lowering the text scanning to O(N).
📊 Impact: Considerably faster document text chunking time due to fewer string reads, especially for large source code files with multiple delimiters.
🔬 Measurement: Verify tests run smoothly without regressions on the chunker code via uv run pytest tests/unit/engine/chunker.


PR created automatically by Jules for task 15926007155550964703 started by @bashandbone

Summary by Sourcery

Optimize keyword delimiter matching in the chunker to reduce redundant regex scans and improve performance when processing source content.

Enhancements:

  • Combine multiple keyword delimiters into a single compiled regex pattern to scan the document once instead of per-delimiter iterations.
  • Skip keyword delimiter processing early when no non-empty keyword delimiters are present.

Documentation:

  • Fix a minor typo in the language normalization parameter description in the delimiter loader.

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
@google-labs-jules
Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

Copilot AI review requested due to automatic review settings March 28, 2026 12:37
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Mar 28, 2026

Reviewer's Guide

Optimizes keyword delimiter matching in the chunker by building a single alternation regex over all keyword delimiters, short‑circuiting when no non‑empty keywords exist, and fixing a minor docstring typo.

Flow diagram for optimized _match_keyword_delimiters keyword scanning

flowchart TD
    A_start["Start _match_keyword_delimiters"] --> B_filter

    B_filter["Filter keyword_delimiters where start is non_empty"] --> C_any

    C_any{"keyword_delimiters list is empty?"} -->|yes| Z_return_empty["Return matches (no keyword delimiters)"]
    C_any -->|no| D_build_struct_pairs

    D_build_struct_pairs["Initialize structural_pairs map"] --> E_build_map

    E_build_map["Build delimiter_map from start_text to delimiter_instance"] --> F_build_regex

    F_build_regex["Build combined_pattern with alternation over all delimiter.start values"] --> G_finditer

    G_finditer["Iterate over re.finditer on combined_pattern and content"] --> H_keyword_pos

    H_keyword_pos["For each match: compute keyword_pos and matched_text"] --> I_lookup_delim

    I_lookup_delim["Lookup delimiter in delimiter_map using matched_text"] --> J_inside_comment

    J_inside_comment{"_is_inside_string_or_comment at keyword_pos?"} -->|yes| G_finditer
    J_inside_comment -->|no| K_find_struct_open

    K_find_struct_open["_find_next_structural_with_char starting after keyword"] --> L_struct_found

    L_struct_found{"struct_start is None?"} -->|yes| G_finditer
    L_struct_found -->|no| M_find_struct_close

    M_find_struct_close["_find_matching_close using struct_start and structural_pairs"] --> N_struct_end

    N_struct_end{"struct_end is None?"} -->|yes| G_finditer
    N_struct_end -->|no| O_calc_nesting

    O_calc_nesting["_calculate_nesting_level at keyword_pos"] --> P_append_match

    P_append_match["Append DelimiterMatch(delimiter, start_pos, end_pos, nesting_level) to matches"] --> G_finditer

    G_finditer -->|no more matches| Q_return_matches

    Q_return_matches["Return matches with all keyword-based DelimiterMatch instances"] --> R_end["End _match_keyword_delimiters"]
Loading

File-Level Changes

Change Details Files
Optimize keyword delimiter scanning to run a single regex over all keywords instead of per‑keyword scans.
  • Return early when the filtered keyword delimiter list is empty to avoid unnecessary setup and iteration.
  • Build a map from keyword text to its Delimiter object and construct one combined word-boundary regex using alternation over all keyword starts.
  • Replace the outer loop over keyword delimiters with a single loop over regex matches from the combined pattern, looking up the corresponding Delimiter from the map.
  • Preserve the existing logic for skipping matches in strings/comments, finding the next structural token, matching closing delimiters, computing nesting level, and appending DelimiterMatch instances.
src/codeweaver/engine/chunker/delimiter.py
Minor documentation and style cleanup.
  • Correct a parameter docstring from 'normalised' to 'normalized' in _load_custom_delimiters.
src/codeweaver/engine/chunker/delimiter.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="src/codeweaver/engine/chunker/delimiter.py" line_range="474-477" />
<code_context>
+                continue

-                # Find the next structural opening after the keyword
-                struct_start, struct_char = self._find_next_structural_with_char(
-                    content,
-                    start=keyword_pos + len(delimiter.start),
-                    allowed=set(structural_pairs.keys()),
-                )
+            # Find the next structural opening after the keyword
</code_context>
<issue_to_address>
**suggestion (performance):** Avoid recreating the `allowed` set on each iteration by precomputing it once outside the loop.

Since `structural_pairs` doesn’t change within this function, compute `allowed_structurals = set(structural_pairs.keys())` once before the `for match in re.finditer(...)` loop and reuse it. This keeps the per-match path leaner, which matters on large files with many keyword hits.

Suggested implementation:

```python
            struct_start, struct_char = self._find_next_structural_with_char(
                content,
                start=keyword_pos + len(delimiter.start),
                allowed=allowed_structurals,

```

To fully implement the optimization, define the `allowed_structurals` set once before the `for match in re.finditer(...):` loop that contains this snippet. For example:

- Just above the loop (and after `structural_pairs` is defined), add:
  `allowed_structurals = set(structural_pairs.keys())`

Ensure that this variable is in the same scope as the loop so it is available where `_find_next_structural_with_char` is called.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines 474 to +477

# Skip if keyword is inside a string or comment
if self._is_inside_string_or_comment(content, keyword_pos):
continue
# Skip if keyword is inside a string or comment
if self._is_inside_string_or_comment(content, keyword_pos):
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): Avoid recreating the allowed set on each iteration by precomputing it once outside the loop.

Since structural_pairs doesn’t change within this function, compute allowed_structurals = set(structural_pairs.keys()) once before the for match in re.finditer(...) loop and reuse it. This keeps the per-match path leaner, which matters on large files with many keyword hits.

Suggested implementation:

            struct_start, struct_char = self._find_next_structural_with_char(
                content,
                start=keyword_pos + len(delimiter.start),
                allowed=allowed_structurals,

To fully implement the optimization, define the allowed_structurals set once before the for match in re.finditer(...): loop that contains this snippet. For example:

  • Just above the loop (and after structural_pairs is defined), add:
    allowed_structurals = set(structural_pairs.keys())

Ensure that this variable is in the same scope as the loop so it is available where _find_next_structural_with_char is called.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes delimiter-based chunking by scanning the document once for all keyword-based delimiters, rather than running a separate regex scan per keyword delimiter.

Changes:

  • Combine keyword delimiter starts into a single alternation regex and map match text back to a delimiter.
  • Add an early return when keyword delimiters are empty after filtering.
  • Fix docstring spelling (“normalised” → “normalized”).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +465 to +473
# Optimization: combining multiple keyword delimiters into a single compiled regex pattern
# using alternation (?:...) avoids looping over keywords with individual re.finditer calls.
delimiter_map = {d.start: d for d in keyword_delimiters}
combined_pattern = rf"\b(?:{'|'.join(re.escape(d.start) for d in keyword_delimiters)})\b"

for match in re.finditer(pattern, content):
keyword_pos = match.start()
for match in re.finditer(combined_pattern, content):
keyword_pos = match.start()
matched_text = match.group(0)
delimiter = delimiter_map[matched_text]
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delimiter_map = {d.start: d ...} will silently drop keyword delimiters that share the same start string (later entries overwrite earlier ones). This codebase has overlapping keyword starts (e.g., both STRUCT_PATTERN and TYPE_ALIAS_PATTERN include "type"), so after this change only one delimiter kind will ever be produced for type matches, changing chunking behavior. Use a mapping from start string to a list of delimiters (or otherwise disambiguate) and emit matches for each applicable delimiter to preserve prior behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +465 to +470
# Optimization: combining multiple keyword delimiters into a single compiled regex pattern
# using alternation (?:...) avoids looping over keywords with individual re.finditer calls.
delimiter_map = {d.start: d for d in keyword_delimiters}
combined_pattern = rf"\b(?:{'|'.join(re.escape(d.start) for d in keyword_delimiters)})\b"

for match in re.finditer(pattern, content):
keyword_pos = match.start()
for match in re.finditer(combined_pattern, content):
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimization changes how keyword delimiters are selected when multiple delimiter kinds share the same keyword (e.g., type). There doesn’t appear to be a unit test exercising overlapping keyword starts for delimiter-based chunking; adding a regression test (e.g., a snippet containing type Foo struct { ... } and/or type Foo = ...) would help prevent future regressions when refactoring this matcher.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of PR #283: Combine keyword delimiters into single regex pattern

Summary

This PR optimizes the _match_keyword_delimiters method by combining multiple keyword delimiter patterns into a single regex using alternation. This is a smart optimization that reduces the complexity from O(K * N) to O(N), where K is the number of keyword delimiters and N is the document size.

Positive Changes

  1. Performance Improvement: The optimization is well-motivated and should provide significant performance gains for large files with multiple keyword delimiters.

  2. Code Quality:

    • Clear explanatory comment added (lines 465-466)
    • Early return optimization for empty delimiter lists (lines 454-455)
    • Logic remains functionally equivalent to the original
  3. Documentation: Minor typo fix ("normalised" → "normalized") at line 1289

Critical Issue Found

⚠️ Bug: Dictionary key collision in delimiter_map

At line 467:

delimiter_map = {d.start: d for d in keyword_delimiters}

This dictionary comprehension will silently overwrite delimiters if multiple delimiters have the same start string. I found at least two cases where this occurs:

  1. "type" appears in:

    • STRUCT_PATTERN (line 146: starts=["struct", "type"])
    • TYPE_ALIAS_PATTERN (line 161: starts=["type", "typedef", "typealias"])
  2. "extension" appears in:

    • MODULE_PATTERN (line 153: starts=["module", "namespace", "package", "mod", "extension"])
    • EXTENSION_PATTERN (line 167: starts=["extension", "extend"])

Since ends="ANY" expands to empty strings (making these keyword delimiters), these patterns will have the same start but different kind, priority, or other properties. The current code will only keep the last one in the list.

Recommendation

The fix should handle duplicate start strings by either:

  1. Use a list of delimiters per key:

    from collections import defaultdict
    delimiter_map = defaultdict(list)
    for d in keyword_delimiters:
        delimiter_map[d.start].append(d)

    Then in the loop, iterate over delimiter_map[matched_text] to create multiple matches.

  2. Add validation to ensure no duplicates exist (if duplicates aren't expected by design)

  3. Sort by priority and only keep the highest priority delimiter for each start string (if this is the intended behavior)

The original code didn't have this issue because it iterated over all delimiters individually, so each one was processed.

Test Status

  • Most CI checks passing
  • Python 3.14t experimental test failing (appears unrelated to this change)
  • Recommend adding a test case that verifies behavior when multiple delimiters share the same start string

Verdict: ❌ Needs changes before merging due to the dictionary collision bug.

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a critical bug where duplicate keyword start strings cause dictionary collisions. See my detailed review comment for full analysis and recommendations.

@github-actions
Copy link
Copy Markdown
Contributor

Additional Finding on Dictionary Collision Bug

I want to emphasize the critical bug identified by @Copilot at src/codeweaver/engine/chunker/delimiter.py:467

The delimiter_map = {d.start: d for d in keyword_delimiters} creates a dictionary collision issue. I independently verified this and found:

Conflicting Keywords:

  1. "type" - appears in both:

    • STRUCT_PATTERN (patterns.py:146)
    • TYPE_ALIAS_PATTERN (patterns.py:161)
  2. "extension" - appears in both:

    • MODULE_PATTERN (patterns.py:153)
    • EXTENSION_PATTERN (patterns.py:167)

Impact: Only the last delimiter in the list will be processed for these keywords, silently changing chunking behavior and potentially missing important code structures.

Recommended Fix:

from collections import defaultdict

# Build map of start strings to lists of delimiters
delimiter_map = defaultdict(list)
for d in keyword_delimiters:
    delimiter_map[d.start].append(d)

combined_pattern = rf"\b(?:{'|'.join(re.escape(start) for start in delimiter_map.keys())})\b"

for match in re.finditer(combined_pattern, content):
    keyword_pos = match.start()
    matched_text = match.group(0)
    
    # Process all delimiters with this start string
    for delimiter in delimiter_map[matched_text]:
        # Skip if keyword is inside a string or comment
        if self._is_inside_string_or_comment(content, keyword_pos):
            continue
        # ... rest of existing logic

This maintains the O(N) performance optimization while correctly handling all matching delimiters.

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review - PR #283

Summary

This PR optimizes the _match_keyword_delimiters method by combining multiple keyword delimiters into a single regex pattern, reducing complexity from O(K × N) to O(N) where K is the number of keyword delimiters and N is the document size.

✅ Strengths

  1. Correct optimization approach: Combining multiple regex patterns using alternation ((?:pattern1|pattern2|...)) is a well-established optimization technique that reduces the number of passes over the input string.

  2. Functionally equivalent: The logic correctly preserves the original behavior by:

    • Using match.group(0) to get the matched keyword text
    • Looking up the corresponding delimiter from delimiter_map
    • Maintaining the same filtering and structural matching logic
  3. Performance improvement: This change will significantly improve performance for large files with many keyword delimiters, especially since the regex engine only scans the document once instead of K times.

  4. Early exit optimization: The addition of the early exit check (lines 454-455) is a good defensive programming practice.

  5. Minor typo fix: Fixed "normalised" to "normalized" for consistency with American English used elsewhere in the codebase.

⚠️ Potential Issues

1. Duplicate delimiter handling (Medium Severity)

The code uses a dictionary to map delimiter start strings to delimiter objects:

delimiter_map = {d.start: d for d in keyword_delimiters}

If multiple delimiters have the same start string, only the last one will be retained in the map. This could lead to:

  • Lost delimiter configurations: If custom delimiters override built-in ones with the same start but different properties (kind, priority, etc.), only one will be used
  • Inconsistent behavior: The behavior would depend on the order of delimiters in the list

Analysis: Based on the deduplication logic in _load_delimiters_for_language (lines 1381-1385), this scenario should be rare since the code checks for (start, end, kind) uniqueness. However, it's theoretically possible for keyword delimiters (which have empty end strings) to have the same start with different kind values.

Recommendation: Consider one of these approaches:

  1. Add assertion: Add a debug assertion to verify no duplicate starts exist in keyword_delimiters
  2. Use list comprehension: Keep only unique delimiter starts using a seen set
  3. Document assumption: Add a comment documenting that duplicate starts are not expected

Example fix:

# Build delimiter map - ensure no duplicate starts (each start should be unique)
seen_starts = set()
unique_delimiters = []
for d in keyword_delimiters:
    if d.start not in seen_starts:
        seen_starts.add(d.start)
        unique_delimiters.append(d)
delimiter_map = {d.start: d for d in unique_delimiters}

2. Regex pattern efficiency with duplicates (Low Severity)

If there are duplicate starts in keyword_delimiters, the combined pattern will contain redundant alternatives:

combined_pattern = rf"\b(?:{'|'.join(re.escape(d.start) for d in keyword_delimiters)})\b"
# Could produce: \b(?:function|function|def)\b

This is inefficient but functionally correct. The same deduplication fix above would address this.

📝 Minor Observations

  1. Comment quality: The optimization comment (lines 465-466) clearly explains the benefit. Well done!

  2. Code style: The change maintains consistency with the surrounding code style.

  3. Testing: The PR description mentions running tests via uv run pytest tests/unit/engine/chunker. The existing integration tests should catch any behavioral regressions.

🎯 Recommendation

Approve with minor suggestion: The optimization is sound and provides real performance benefits. The potential duplicate handling issue is theoretical and may not occur in practice due to existing deduplication logic. However, I recommend adding either:

  • A comment documenting the assumption that keyword_delimiters contains unique start strings, OR
  • Defensive code to deduplicate by start string

The CI failure for Python 3.14t appears to be experimental and can be safely ignored based on the workflow configuration (continue-on-error: true for experimental versions).

Verdict

LGTM with minor suggestion - The optimization is correct and valuable. Consider adding duplicate handling or documentation for robustness.

@bashandbone
Copy link
Copy Markdown
Contributor Author

@jules please address the identified key collision bug.

See below suggestion, another option would be use to tuples

Additional Finding on Dictionary Collision Bug

I want to emphasize the critical bug identified by @copilot at src/codeweaver/engine/chunker/delimiter.py:467

The delimiter_map = {d.start: d for d in keyword_delimiters} creates a dictionary collision issue. I independently verified this and found:

Conflicting Keywords:

  1. "type" - appears in both:

    • STRUCT_PATTERN (patterns.py:146)
    • TYPE_ALIAS_PATTERN (patterns.py:161)
  2. "extension" - appears in both:

    • MODULE_PATTERN (patterns.py:153)
    • EXTENSION_PATTERN (patterns.py:167)

Impact: Only the last delimiter in the list will be processed for these keywords, silently changing chunking behavior and potentially missing important code structures.

Recommended Fix:

from collections import defaultdict

# Build map of start strings to lists of delimiters
delimiter_map = defaultdict(list)
for d in keyword_delimiters:
    delimiter_map[d.start].append(d)

combined_pattern = rf"\b(?:{'|'.join(re.escape(start) for start in delimiter_map.keys())})\b"

for match in re.finditer(combined_pattern, content):
    keyword_pos = match.start()
    matched_text = match.group(0)
    
    # Process all delimiters with this start string
    for delimiter in delimiter_map[matched_text]:
        # Skip if keyword is inside a string or comment
        if self._is_inside_string_or_comment(content, keyword_pos):
            continue
        # ... rest of existing logic

This maintains the O(N) performance optimization while correctly handling all matching delimiters.

@google-labs-jules
Copy link
Copy Markdown
Contributor

@jules please address the identified key collision bug.

See below suggestion, another option would be use to tuples

Additional Finding on Dictionary Collision Bug

I want to emphasize the critical bug identified by @Copilot at src/codeweaver/engine/chunker/delimiter.py:467

The delimiter_map = {d.start: d for d in keyword_delimiters} creates a dictionary collision issue. I independently verified this and found:

Conflicting Keywords:

  1. "type" - appears in both:

    • STRUCT_PATTERN (patterns.py:146)
    • TYPE_ALIAS_PATTERN (patterns.py:161)
  2. "extension" - appears in both:

    • MODULE_PATTERN (patterns.py:153)
    • EXTENSION_PATTERN (patterns.py:167)

Impact: Only the last delimiter in the list will be processed for these keywords, silently changing chunking behavior and potentially missing important code structures.

Recommended Fix:

from collections import defaultdict

# Build map of start strings to lists of delimiters
delimiter_map = defaultdict(list)
for d in keyword_delimiters:
    delimiter_map[d.start].append(d)

combined_pattern = rf"\b(?:{'|'.join(re.escape(start) for start in delimiter_map.keys())})\b"

for match in re.finditer(combined_pattern, content):
    keyword_pos = match.start()
    matched_text = match.group(0)
    
    # Process all delimiters with this start string
    for delimiter in delimiter_map[matched_text]:
        # Skip if keyword is inside a string or comment
        if self._is_inside_string_or_comment(content, keyword_pos):
            continue
        # ... rest of existing logic

This maintains the O(N) performance optimization while correctly handling all matching delimiters.

Thank you for finding this collision bug. I've updated the implementation to use a defaultdict(list) for the delimiter_map. Now, instead of blindly returning the last defined delimiter, the loop will properly iterate over all delimiters that share the same start string to ensure no code structures are missed while keeping the O(N) performance benefit.

google-labs-jules bot and others added 2 commits March 28, 2026 19:53
Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
…sition lookups

- Move 'from collections import defaultdict' from inline function body to module-level imports
- Add type annotation: defaultdict[str, list[Delimiter]]
- Check _is_inside_string_or_comment once per match position (not once per delimiter)
- Call _find_next_structural_with_char and _find_matching_close once per position
- Use matches.extend() with a generator instead of an append loop
Copilot AI review requested due to automatic review settings March 28, 2026 20:00
@bashandbone bashandbone review requested due to automatic review settings March 28, 2026 20:00
Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 28, 2026

@jules please address the identified key collision bug.

See below suggestion, another option would be use to tuples
...

Fixed in commit 1a98b39. Replaced {d.start: d for d in keyword_delimiters} with defaultdict(list) so all delimiters sharing a start string (e.g., both STRUCT and TYPE_ALIAS for "type", both MODULE and EXTENSION for "extension") produce DelimiterMatch entries.

Additionally moved the defaultdict import to module level, added a type annotation, and hoisted the string/comment check and structural-finding calls out of the inner delimiter loop (they're positional — same result for all delimiters at the same match).

@bashandbone
Copy link
Copy Markdown
Contributor Author

Addressed by #282

@google-labs-jules
Copy link
Copy Markdown
Contributor

Addressed by #282

Understood. Acknowledging that this work is now obsolete as it was addressed by #282, and stopping work on this task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants